Model Selection

Multimodal Contrastive Learning

# Multimodal Contrastive Learning

PE Core B16 224

The Perception Encoder is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.

PE Core G14 448

The Perception Encoder (PE) is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.

PE Core L14 336

A large-scale visual encoder model developed by Meta, achieving state-of-the-art performance in various vision tasks through contrastive pre-training and fine-tuning on synthetic video data

Sail Clip Hendrix 10epochs

A vision-language model fine-tuned from openai/clip-vit-large-patch14, trained for 10 epochs

Git-RSCLIP is a vision-language model pretrained on the Git-10M dataset, specializing in multimodal understanding of remote sensing images.

Vit SO400M 14 SigLIP2

A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.

Eva02 Large Patch14 Clip 336.merged2b

EVA02 CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.

Brahmai Clip V0.1

CLIP model based on ViT-L/14 and masked self-attention Transformer for zero-shot image classification research

Transformers English

brahmairesearch

Resnet50x64 Clip.openai

CLIP model based on the ResNet50x64 architecture from the OpenCLIP library, supporting zero-shot image classification tasks.

Image Classification

Fashion Embedder

FashionCLIP is a vision-language model based on CLIP, specifically fine-tuned for the fashion domain, capable of generating universal fashion product representations.

Transformers English

This is a vision-language model based on the CLIP architecture, specifically post-trained on 80 million face images.

Multimodal Fusion

Clip Vit Base Patch32

CLIP model developed by OpenAI, based on Vision Transformer architecture, supporting joint understanding of images and text

CLIP ViT B 16 DataComp.L S1b B8k

A zero-shot image classification model based on the CLIP architecture, trained using the DataComp dataset, supporting efficient image-text matching tasks.

CLIP ViT B 16 CommonPool.L.clip S1b B8k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 16 CommonPool.L.laion S1b B8k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks, trained on the laion-s1B-b8K dataset

CLIP ViT B 16 CommonPool.L.text S1b B8k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 16 CommonPool.L S1b B8k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 DataComp.M S128m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks, trained on the DataComp.M dataset

CLIP ViT B 32 CommonPool.M.laion S128m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 CommonPool.M.image S128m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 CommonPool.M.text S128m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 CommonPool.M.basic S128m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

CLIP ViT B 32 CommonPool.M S128m B4k

Zero-shot image classification model based on CLIP architecture, supporting general vision-language tasks

CLIP ViT B 32 DataComp.S S13m B4k

A zero-shot image classification model based on the CLIP architecture, trained on the DataComp dataset, supporting various vision tasks.

CLIP ViT B 32 CommonPool.S.clip S13m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 CommonPool.S S13m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

Eva02 Enormous Patch14 Clip 224.laion2b S4b B115k

Large-scale vision-language model based on EVA02 architecture, supporting zero-shot image classification tasks

ALIGN is a vision-language dual-encoder model that aligns image and text representations through contrastive learning, achieving state-of-the-art cross-modal representation with large-scale noisy data.

Multimodal Alignment

Transformers English

FashionCLIP is a vision-language model fine-tuned specifically for the fashion domain based on CLIP, capable of generating universal product representations.

Transformers English

AltCLIP is a simple and efficient bilingual CLIP model supporting Chinese-English text-image representation tasks.

Transformers Supports Multiple Languages

Vit Base Patch16 Clip 224.openai

CLIP is a vision-language model developed by OpenAI, trained via contrastive learning for image and text encoders, supporting zero-shot image classification.

Biomedvlp CXR BERT General

CXR-BERT is a specialized language model developed for the chest X-ray domain, optimized for radiology text processing through improved vocabulary and pretraining procedures

Large Language Model

Transformers English

A remote sensing image-specific model fine-tuned based on OpenAI CLIP, enhancing zero-shot classification and image retrieval capabilities

Clip Vit Base Patch16

CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.

A remote sensing image-specific model fine-tuned based on OpenAI CLIP, enhancing zero-shot classification and cross-modal retrieval capabilities

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase